Skip to content

feat(compression): enable dictionary compression in pure Rust backend#229

Merged
polaz merged 31 commits into
mainfrom
feat/#218-featcompression-enable-dictionary-compression-in-p
Apr 9, 2026
Merged

feat(compression): enable dictionary compression in pure Rust backend#229
polaz merged 31 commits into
mainfrom
feat/#218-featcompression-enable-dictionary-compression-in-p

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 7, 2026

Summary

  • Implements `compress_with_dict()` in `ZstdPureProvider` using `FrameCompressor` from `structured-zstd` v0.0.11 (feat: dictionary builder — FastCOVER algorithm and dictionary finalization structured-zstd#25)
  • Removes the C FFI `zstd` crate entirely; `structured-zstd` is now the sole backend under the `zstd` feature flag
  • `zstd-pure` becomes a deprecated alias (`zstd-pure = ["zstd"]`) — enabling it is equivalent to enabling `zstd`
  • Supports both finalized zstd dictionaries (magic bytes `37 A4 30 EC` + entropy tables) and raw content dictionaries
  • TLS caching: single-entry `FrameCompressor` / `FrameDecoder` per thread, keyed by 64-bit xxh3 fingerprint + level
  • `strip_dict_id`: rewritten in-place (`get_mut` + `copy_within` + `truncate`) — eliminates O(frame_len) allocation per compressed block
  • Bug fix: `decode_raw_content_bounded` with `capacity=0` and an empty frame no longer incorrectly returns `DecompressedSizeTooLarge`; uses `remaining.max(1)` in `UptoBytes` so the decoder advances past the empty Last_Block before capacity is checked
  • CI: `test-zstd-pure` → `test-zstd`, matrix extended to include MSRV 1.92.0

Technical Details

Format detection: `compress_with_dict` and `decompress_with_dict` check for the zstd dictionary magic prefix (bytes `37 A4 30 EC`, little-endian `0xEC30A437`):

  • Finalized dict → `Dictionary::decode_dict` (entropy tables + content)
  • Raw content dict → `Dictionary::from_raw_content` with ID = lower 32 bits of xxh3, clamped to ≥1 (id=0 is reserved in the zstd frame format)

Raw-content dict ID stripping: `compress_with_dict` strips the synthetic dictID from the frame header after compression. This matches the zstd standard convention where `dictID=0` (absent) means "raw-content dict, id unknown, accept any", preventing decompressors from requiring the specific synthetic id.

Decompression-bomb guard: The raw-content dict decompress path calls `decoder.content_size()` after `init()` and rejects frames whose declared size exceeds the caller's capacity limit before allocating the output buffer. Frames without the FCS field fall back to the post-decode check.

`ZstdDictionary::id()`: Returns the raw lower 32 bits of xxh3 (may theoretically be 0). Config validation paths compare `dict.id()` against on-disk `dict_id` — both sides derive the same value, so validation is unaffected. The `.max(1)` clamp is only applied inside the backend when embedding an id in a zstd frame header.

Blocker resolved: structured-world/structured-zstd#25 (FastCOVER + dictionary finalization) merged; `structured-zstd` bumped to v0.0.11 (encoding performance improvements: row-based match finder, HC positions rebase, streaming scratch buffer reuse, FSE decoder packing, and HC table improvements).

Known Limitations

Test Plan

  • `src/compression/zstd_pure.rs` — roundtrip, magic detection, all levels, empty input, raw content dict, capacity guard, in-place strip_dict_id, empty-frame-at-capacity-0 regression
  • `tests/zstd_dict_roundtrip.rs` — full Tree write/flush/read, range scan, reopen, missing/wrong dict errors, finalized dict, encryption, per-level policy
  • `tests/zstd_dict_roundtrip.rs` — compaction path: 3 L0 SSTs flushed → `major_compact` → 300 keys readable + range scan (exercises both `compress_with_dict` and `decompress_with_dict` on the compaction hot path)
  • `tests/zstd_dict_roundtrip.rs` — reopen with wrong dict fails at recovery (`ZstdDictMismatch` on first SST read)
  • `cargo nextest run --no-default-features --features zstd,lz4`: 1208 passed
  • `cargo nextest run --all-features`: 1256 passed
  • `cargo clippy --all-features -- -D warnings`: clean
  • `cargo clippy --no-default-features --features zstd,lz4 --all-targets -- -D warnings`: clean
  • CI job `test-zstd` (renamed from `test-zstd-pure`) runs on `[stable, "1.92.0"]`

Closes #218

Summary by CodeRabbit

  • New Features

    • Added support for zstd dictionary compression and decompression (pure-Rust backend).
  • Documentation

    • Clarifies zstd now uses a pure‑Rust backend (no C toolchain), documents performance tradeoffs, current limitations, and marks the old alias as deprecated.
  • Tests

    • Substantially expanded unit and integration tests for dictionary behaviors and round‑trip/compaction scenarios.
  • Chores

    • CI and benchmark workflow updates; dependency and build-script feature wiring adjusted.

- Implement `compress_with_dict()` in `ZstdPureProvider` using
  `FrameCompressor::set_dictionary()` from `structured-zstd` v0.0.7,
  which added dictionary encoder support in structured-world/structured-zstd#25
- Support both finalized zstd dictionaries (magic 0x37A430EC + entropy
  tables, parsed via `decode_dict`) and raw content dictionaries (bare
  byte sequences, parsed via `from_raw_content` with xxh3-derived ID),
  matching the transparent handling in the C FFI backend
- Remove the config-time guard that rejected `ZstdDict` compression
  policies under the pure backend — the feature is now fully supported
- Update `decompress_with_dict` to apply the same format-detection
  logic so raw content dicts round-trip correctly end-to-end

Unit tests (src/compression/zstd_pure.rs):
  - compress_with_dict_roundtrip_pure_to_pure
  - compress_with_dict_produces_zstd_magic
  - compress_with_dict_roundtrip_all_levels (levels 1, 3, 9, 19)
  - compress_with_dict_empty_dict_returns_error
  - compress_with_dict_raw_content_dict_works
  - compress_with_dict_empty_plaintext_roundtrips

Integration tests (tests/zstd_dict_roundtrip.rs, zstd_pure_dict mod):
  - pure_tree_write_flush_read_zstd_dict (full Tree write→flush→read)
  - pure_tree_range_scan_with_zstd_dict
  - pure_tree_reopen_with_dict_reads_back_correctly
  - pure_zstd_dict_missing_returns_error

Closes #218
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 08b341ed-927a-4b05-9a14-033d60327186

📥 Commits

Reviewing files that changed from the base of the PR and between 5f2e3a0 and eb34b7f.

📒 Files selected for processing (1)
  • src/compression/zstd_pure.rs

📝 Walkthrough

Walkthrough

Replaced the C-FFI zstd backend with a single pure‑Rust structured‑zstd backend; implemented dictionary compression/decompression (finalized vs raw-content) with TLS caching and compressor cache; simplified ZstdDictionary API; removed zstd_ffi; updated Cargo/features, build script, CI, README, benchmarks, and tests.

Changes

Cohort / File(s) Summary
Feature & dependency wiring
Cargo.toml, build.rs
zstd now maps to optional structured-zstd; zstd-pure aliased to zstd; removed separate optional zstd dependency entry; build.rs emits zstd_any only when CARGO_FEATURE_ZSTD is set.
Pure Rust backend (dictionary support)
src/compression/zstd_pure.rs
Added DICT_MAGIC detection, strip_dict_id, bounded raw-content decoding, do_decompress_with_dict, full compress_with_dict implementation (FrameCompressor), TLS single-entry compressor cache keyed by (xxh3_64(dict_raw), level), raw vs finalized-dict paths, and extensive unit tests.
Compression module & types
src/compression/mod.rs
Removed C-FFI backend selection; ZstdBackend now resolves to pure‑Rust provider under zstd; removed prepared decoder field and accessor from ZstdDictionary; adjusted docs and id64 cfg visibility.
Removed C‑FFI backend
src/compression/zstd_ffi.rs
Deleted the C-FFI zstd backend and its ZstdFfiProvider implementation.
Configuration validation
src/config/mod.rs
Config::validate_zstd_dictionary() no longer rejects ZstdDict at compile-time under pure-Rust; now computes dict_id and validates presence/match for referenced dictionaries.
CI / Coverage / Benchmarks
.github/workflows/coordinode-ci.yml, .github/workflows/benchmark.yml, benches/zstd_dict.rs
Added test-zstd job and zstd-specific coverage run; updated benchmark checkout to actions/checkout@v6; renamed benchmark label to tls_hit and adjusted comments.
Docs & README
README.md
Updated feature-flag docs: zstd described as pure‑Rust structured-zstd backend (no C deps), performance note added; zstd-pure deprecated as alias.
Tests / Integration
tests/zstd_dict_roundtrip.rs
Added integration test zstd_dict_survives_major_compaction verifying Zstd-dict compression survives major compaction and validates reads and range scans.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ZstdPure as ZstdPureProvider
    participant TLSCache
    participant Compressor as FrameCompressor
    participant Decoder as FrameDecoder

    rect rgba(200,200,255,0.5)
    Caller->>ZstdPure: compress_with_dict(input, dict_raw, level)
    ZstdPure->>TLSCache: lookup (xxh3_64(dict_raw), level)
    TLSCache-->>ZstdPure: compressor (create if missing)
    ZstdPure->>Compressor: compress(input, dict)
    Compressor-->>ZstdPure: compressed_frame
    ZstdPure->>ZstdPure: if raw-content dict → strip_dict_id(frame)
    ZstdPure-->>Caller: compressed_frame
    end

    rect rgba(200,255,200,0.5)
    Caller->>ZstdPure: decompress_with_dict(frame, dict_raw, capacity)
    ZstdPure->>ZstdPure: detect dict format (DICT_MAGIC / raw)
    alt raw-content path
        ZstdPure->>Decoder: init with force_dict + bounded decode
        Decoder-->>ZstdPure: decompressed_bytes / size_error
    else finalized-dict path
        ZstdPure->>Decoder: decode_all_to_vec(finalized)
        Decoder-->>ZstdPure: decompressed_bytes
    end
    ZstdPure-->>Caller: decompressed_bytes / error
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #218: feat(compression): enable dictionary compression in pure Rust backend — This PR implements pure‑Rust dictionary compression/decompression requested by that issue.
  • #25: feat: dictionary builder — FastCOVER algorithm and dictionary finalization — The PR consumes finalized-dictionary format expectations (finalized dicts used by the compressor).
  • #232: (prepared-decoder caching proposal) — The PR replaces the prepared-field with TLS caching keyed by a 64-bit fingerprint, directly addressing the related caching/design surface.

Possibly related PRs

Poem

🐇
I hopped from FFI hollows to structured‑zstd bright,
Pure Rust frames now hum through day and night,
Dicts compress and decompress with care,
TLS caches warm, IDs stripped in the air,
Tests clap their paws — round trips pass just right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(compression): enable dictionary compression in pure Rust backend' accurately describes the main change—implementing compress_with_dict() in ZstdPureProvider using structured-zstd.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from linked issues: #25 (FastCOVER/dictionary finalization in upstream structured-zstd, now resolved) and #218 (implement compress_with_dict in pure backend with round-trip compatibility and integration tests).
Out of Scope Changes check ✅ Passed All changes are directly related to enabling dictionary compression in the pure Rust backend: structured-zstd bump, C-FFI backend removal, ZstdPureProvider implementation updates, test additions, and CI/build configuration adjustments are all in scope.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#218-featcompression-enable-dictionary-compression-in-p

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 97.39583% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/compression/zstd_pure.rs 97.39% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compression/mod.rs`:
- Around line 38-43: The documentation for the compression function is too
strict: update the doc comment that currently says "must be a finalized zstd
dictionary" to state that dict_raw accepts both finalized zstd dictionaries
(magic 0x37A430EC header, entropy tables, content) and bare content dictionaries
as produced/returned by ZstdDictionary::new() / ZstdDictionary::raw(); mention
that both the C FFI backend and the pure Rust backend accept both formats so
callers know either form is valid.

In `@src/compression/zstd_pure.rs`:
- Around line 111-118: The frame encoder currently clamps the xxh3 low-32-bit
hash to 1 locally (let id = { xxh3_64(dict_raw) as u32; h.max(1) }) causing
mismatch with ZstdDictionary::id() and CompressionType::ZstdDict; instead
compute a single normalized dict id from the raw hash in the shared
dictionary-ID path (e.g., produce raw_id = (xxh3_64(dict_raw) as u32) and
normalized_id = raw_id.max(1)) and use normalized_id everywhere metadata,
ZstdDictionary::id(), CompressionType::ZstdDict, and frame encoding read/write
IDs; remove the local h.max(1) clamping in the encoder and apply the same change
for the other occurrence around lines 179-183 so both metadata and frame headers
observe the identical normalized id.

In `@tests/zstd_dict_roundtrip.rs`:
- Around line 270-271: The zstd_pure_dict module is gated with #[cfg(all(feature
= "zstd-pure", not(feature = "zstd")))] and will be skipped by cargo test
--all-features, so add a dedicated CI job that runs tests with the pure backend
enabled (e.g., run cargo test --no-default-features --features zstd-pure or
cargo test --features zstd-pure while ensuring the "zstd" feature is not
enabled) so the zstd_pure_dict module is exercised; update your CI workflow to
include this new job (name it clearly, e.g., "test: zstd-pure") and ensure it
uses the same test matrix/platforms as the other test jobs.
- Around line 278-291: The current make_test_dictionary() only wraps raw sample
bytes with ZstdDictionary::new(), so tests never exercise the
finalized-dictionary path; change or add a test fixture that actually
trains/finalizes a dictionary (use the library's training/finalize API instead
of ZstdDictionary::new()), then use that finalized dictionary in an end-to-end
encode/decode test to exercise Dictionary::decode_dict and
FrameCompressor::set_dictionary_from_bytes; locate make_test_dictionary and
replace or add a make_finalized_test_dictionary that invokes the training
function (or calls the dictionary finalization method) on the same samples and
return the finalized ZstdDictionary for the new test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f444fc6-12f8-4148-ae09-8969ce781690

📥 Commits

Reviewing files that changed from the base of the PR and between ab61d33 and 746a513.

📒 Files selected for processing (5)
  • README.md
  • src/compression/mod.rs
  • src/compression/zstd_pure.rs
  • src/config/mod.rs
  • tests/zstd_dict_roundtrip.rs
💤 Files with no reviewable changes (2)
  • README.md
  • src/config/mod.rs

Comment thread src/compression/mod.rs Outdated
Comment thread src/compression/zstd_pure.rs Outdated
Comment thread tests/zstd_dict_roundtrip.rs Outdated
Comment thread tests/zstd_dict_roundtrip.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables zstd dictionary compression for the zstd-pure backend (structured-zstd), removing the previous config-time rejection and adding unit + integration tests to exercise dictionary-compressed Tree roundtrips under the pure backend.

Changes:

  • Implement compress_with_dict() for ZstdPureProvider, including finalized-vs-raw dictionary format detection.
  • Remove the Config::open() guard that previously rejected CompressionType::ZstdDict under zstd-pure.
  • Add new unit tests (src/compression/zstd_pure.rs) and integration tests (tests/zstd_dict_roundtrip.rs) for pure-backend dictionary compression.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/compression/zstd_pure.rs Implements pure-backend dictionary compression and expands unit test coverage.
src/config/mod.rs Removes the pure-backend dictionary-compression rejection during config validation.
src/compression/mod.rs Updates trait-level docs to reflect dictionary support in the pure backend.
tests/zstd_dict_roundtrip.rs Adds integration tests for dict compression via the pure backend through the Tree API.
README.md Updates documentation by removing the old “dict compression unsupported” limitation for zstd-pure.

Comment thread src/compression/mod.rs Outdated
Comment thread tests/zstd_dict_roundtrip.rs Outdated
…-zero

- Move .max(1) into ZstdDictionary::id() so config validation, stored
  metadata, and zstd frame headers all observe the same dict id
  (id=0 is invalid in the zstd frame format; edge case when xxh3
  lower 32 bits are zero)
- Remove redundant .max(1) from decompress_with_dict raw-content path
- Update compress_with_dict trait doc to reflect that both finalized
  and raw-content dictionaries are accepted by both backends
- Add dedicated test-zstd-pure CI job to exercise the pure backend
  independently (--all-features enables C zstd and skips the
  not(feature = "zstd") gate)
- Add pure_finalized_dict_roundtrip integration test to cover the
  Dictionary::decode_dict path in the pure Rust backend
- Correct make_test_dictionary() doc: builds a raw-content dict, not
  a finalized-dictionary fixture
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment thread src/compression/mod.rs Outdated
Comment thread src/compression/zstd_pure.rs Outdated
Comment thread .github/workflows/coordinode-ci.yml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/coordinode-ci.yml:
- Around line 62-80: Add a new step in the existing test-zstd-pure job to run
clippy under the pure-only feature combo: insert a step before the "Run tests"
step that runs `cargo clippy --no-default-features --features zstd-pure,lz4
--all-targets -- -D warnings`; this ensures code behind #[cfg(all(feature =
"zstd-pure", not(feature = "zstd")))] (e.g. src/compression/zstd_pure.rs and
tests/zstd_dict_roundtrip.rs) is linted and won't introduce warnings.

In `@src/compression/mod.rs`:
- Around line 158-176: The current normalization in ZstdDictionary::id() forces
id=0 to 1 and breaks read compatibility: revert or limit that normalization so
existing stored metadata with dict_id == 0 can still be recognized;
specifically, keep ZstdDictionary::id64() unchanged, change ZstdDictionary::id()
to return the raw lower 32 bits (including 0) for read/validation paths and
perform the clamp-to-1 only in write/encoding code paths that emit new frame
headers (i.e., where CompressionType::ZstdDict is serialized), or alternatively
make validation logic that compares dictionaries accept both raw 0 and
normalized 1 as equivalent for the same dictionary; update the code paths that
currently call ZstdDictionary::id() for config validation and block/table
open-time checks to use the raw value or the tolerant comparison, and add a
regression test that writes metadata with dict_id == 0 and verifies it can be
read back without triggering ZstdDictMismatch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7702fe5-b3ec-47af-9ebb-251245a313de

📥 Commits

Reviewing files that changed from the base of the PR and between 746a513 and aacc1f1.

📒 Files selected for processing (4)
  • .github/workflows/coordinode-ci.yml
  • src/compression/mod.rs
  • src/compression/zstd_pure.rs
  • tests/zstd_dict_roundtrip.rs

Comment thread .github/workflows/coordinode-ci.yml Outdated
Comment thread src/compression/mod.rs Outdated
- Add TLS FrameCompressor cache to compress_with_dict in pure Rust backend,
  matching the existing FrameDecoder TLS cache in decompress_with_dict
- Cache key: (xxh3_64(dict_raw), level); dictionary is parsed at most once
  per thread per distinct (dict, level) pair
- Use owned Cursor<Vec<u8>> source and Vec<u8> drain for 'static TLS compat
- Revert ZstdDictionary::id() to return raw lower-32-bits of xxh3 (no clamping);
  backends that embed dict_id in the zstd frame header clamp to .max(1) themselves
  to preserve backward compat with on-disk dict_id=0 (theoretical)
- Update id() doc to clarify that clamping is the backend's responsibility
- Restore dict.id().max(1) in decompress_with_dict raw-content path
- Add cargo clippy step for pure backend in test-zstd-pure CI job
- Update benchmark.yml actions/checkout from v4 to v6
Comment thread .github/workflows/coordinode-ci.yml Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/compression/mod.rs:178

  • PR description says ZstdDictionary::id() is normalized to always return ≥1, but the current implementation returns the raw lower 32 bits of xxh3 and the docs explicitly note it may be 0. Since id=0 is reserved/invalid in zstd frame headers, leaving id() able to return 0 means on-disk CompressionType::ZstdDict{dict_id} can be 0 while the pure backend clamps the frame header dict ID to ≥1 for raw-content dictionaries. Please either implement the promised normalization (and adjust docs/tests accordingly) or update the PR description to match the intended semantics.
    /// Returns a 32-bit dictionary fingerprint (lower 32 bits of xxh3).
    ///
    /// Intended for config validation (matching a `CompressionType::ZstdDict`
    /// `dict_id` against the supplied `ZstdDictionary`) and external interop.
    ///
    /// The value is the raw lower 32 bits of xxh3 and may theoretically be `0`
    /// (probability ≈ 1/2³²). Backends that embed a dict ID in the zstd frame
    /// header (where id=0 is reserved) are responsible for clamping to at
    /// least 1 themselves. Config validation is unaffected: both sides derive
    /// the ID from the same bytes and therefore agree even in the zero case.
    ///
    /// For internal cache keying use [`id64`](ZstdDictionary::id64) to avoid
    /// hash collisions.
    #[must_use]
    #[expect(
        clippy::cast_possible_truncation,
        reason = "intentional: public API returns 32-bit fingerprint"
    )]
    pub fn id(&self) -> u32 {
        self.id as u32
    }

Comment thread src/compression/zstd_pure.rs Outdated
Comment thread src/compression/zstd_pure.rs Outdated
- Fix module-level doc: remove reference to set_dictionary_from_bytes
  (implementation uses set_dictionary with a parsed Dictionary)
- Reuse the exhausted Cursor<Vec<u8>> source buffer across compress() calls
  via take_source().map_or_else(...) to avoid a per-block O(data.len())
  allocation after the first call on a given thread; drain ownership is
  transferred to the caller so its capacity cannot be recovered without
  an extra copy
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compression/zstd_pure.rs`:
- Line 130: Extract the duplicated DICT_MAGIC array into a single module-level
constant and use it wherever the inline definition appears: add a top-level
const DICT_MAGIC: [u8; 4] = [0x37, 0xA4, 0x30, 0xEC] (with doc comment) near the
file imports, then remove the inline definitions in the functions (e.g., the
current local DICT_MAGIC in decompress_with_dict) and replace their uses with
the new module-level DICT_MAGIC constant so the magic value is defined once.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9739d26c-0a93-4be1-baaf-f423a58fe9c8

📥 Commits

Reviewing files that changed from the base of the PR and between 7650e6b and 1aec535.

📒 Files selected for processing (1)
  • src/compression/zstd_pure.rs

Comment thread src/compression/zstd_pure.rs Outdated
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 7, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compression/zstd_pure.rs`:
- Around line 146-159: The test suite is missing a cross-backend regression test
for raw-content dictionaries: add a test that creates a raw-content dictionary
(use Dictionary::from_raw_content with the synthetic id logic based on dict_key
and DICT_MAGIC/Dictionary::decode_dict semantics), compresses data using the
pure backend, then attempts to decompress the resulting frame using the FFI
backend to assert successful round-trip; ensure the test exercises the path
where dict_raw does not start with DICT_MAGIC so the synthetic id branch (the id
computed from dict_key) is used, and verify decompression succeeds despite the
differing dictID treatment between backends.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 027cfd5b-791d-4f90-a206-b3c0f3d2f603

📥 Commits

Reviewing files that changed from the base of the PR and between ab61d33 and 59bd591.

📒 Files selected for processing (7)
  • .github/workflows/benchmark.yml
  • .github/workflows/coordinode-ci.yml
  • README.md
  • src/compression/mod.rs
  • src/compression/zstd_pure.rs
  • src/config/mod.rs
  • tests/zstd_dict_roundtrip.rs
💤 Files with no reviewable changes (2)
  • README.md
  • src/config/mod.rs

Comment thread src/compression/zstd_pure.rs
The pure Rust backend (structured-zstd) rejects dictID=0 for
Dictionary::from_raw_content, so it internally assigns a synthetic
non-zero ID derived from xxh3_64(dict_raw). The C zstd library always
records dictID=0 (absent) for raw-content dicts, causing two failures:

  pure → FFI: pure embeds synthetic dictID in frame header;
  libzstd ZSTD_DDict has id=0; decompressor checks
  dctx->dictID(0) != fParams.dictID(synthetic) → "Dictionary mismatch".

  FFI → pure: FFI records dictID=0 in frame; structured-zstd
  treats dictID=0 as "no dict required" and skips dict lookup,
  then fails decompression with NotEnoughBytesInDictionary.

Fix the pure backend in both directions:

  compress_with_dict (raw-content): after compression, strip the
  Dict_ID field from the zstd frame header via strip_dict_id().
  The synthetic ID is an internal detail — removing it aligns the
  output with the C convention (dictID absent/0), so the C FFI
  decompressor skips the ID check and accepts the frame.

  decompress_with_dict (raw-content): replace decode_all_to_vec
  (which calls init() internally and skips dict loading when
  dictID=0) with the manual init() → force_dict() → decode_blocks()
  → collect() flow. force_dict() loads the raw-content dict
  regardless of the frame's dictID field, handling all three cases:
    - Frame from C FFI backend (dictID absent): force_dict loads dict.
    - Frame from new pure backend (dictID stripped): same.
    - Frame from old pure backend (dictID=synthetic): force_dict
      reloads same dict (idempotent).

Add cross-backend interoperability tests gated on
#[cfg(all(test, feature = "zstd", feature = "zstd-pure"))]:
  pure_compress_ffi_decompress_raw_content_dict_roundtrip
  ffi_compress_pure_decompress_raw_content_dict_roundtrip

mod zstd_pure is now compiled under #[cfg(feature = "zstd-pure")]
(previously not(feature = "zstd")) so it is visible to the
cross-backend test module when --all-features is used.

Closes #218
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'lsm-tree db_bench'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: ffd9512 Previous: ab61d33 Ratio
mergerandom 261870.7610778383 ops/sec (normalized) 380260.6261186645 ops/sec (normalized) 1.45

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compression/zstd_pure.rs`:
- Around line 151-174: The TLS variables (e.g., TLS_COMPRESSOR holding
Option<(u64, i32, CachedCompressor)>) implement a single-entry memoizer contrary
to the comment; replace the Option with a small keyed cache (HashMap or tiny
LRU) keyed by (dict_hash: u64, level: i32) so a thread can store multiple
CachedCompressor entries and avoid re-parsing when switching
dictionaries/levels; update the thread_local! RefCell<Option<...>> declarations
to RefCell<HashMap<(u64, i32), CachedCompressor>> or a bounded LRU map and apply
the same change to the analogous TLS cells referenced in the diff (the decoder
TLS instances), ensuring lookups insert/evict correctly and keys use
dict.id64()/level as described in the comments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b0df8e2-09ae-4eda-a5fd-d5d8f7ca0431

📥 Commits

Reviewing files that changed from the base of the PR and between 59bd591 and 9ce1517.

📒 Files selected for processing (2)
  • src/compression/mod.rs
  • src/compression/zstd_pure.rs

Comment thread src/compression/zstd_pure.rs
@polaz polaz requested a review from Copilot April 7, 2026 21:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment thread tests/zstd_dict_roundtrip.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/compression/mod.rs`:
- Around line 110-111: The doc comment for the handle is incorrect: `new()`
stores the full 64-bit xxh3 fingerprint in `self.id` (see `id64()`), while
`id()` truncates to the lower 32 bits; update the comment to state that the
handle stores the full 64‑bit xxh3 hash of `raw` and that `id()` returns the
lower 32 bits on demand (referencing `ZstdDictionary::id` and `id64()` for
clarity).

In `@tests/zstd_dict_roundtrip.rs`:
- Around line 269-272: The test currently masks failures by calling
tree.level_table_count(0).unwrap_or(0); instead, call level_table_count(0) and
assert on its success and result explicitly: retrieve the Result via
.expect(...) or .unwrap() with a clear message (e.g., on the call to
level_table_count), bind it to a variable (e.g., count) and then assert_eq!(0,
count, "L0 must be empty after major_compact — compaction may not have run");
this ensures failures to fetch the L0 count fail the test rather than being
coerced to zero.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b978752-95b1-42f5-bebc-a380b250c4d8

📥 Commits

Reviewing files that changed from the base of the PR and between ef55325 and dd0275f.

📒 Files selected for processing (4)
  • .github/workflows/coordinode-ci.yml
  • src/compression/mod.rs
  • src/compression/zstd_pure.rs
  • tests/zstd_dict_roundtrip.rs

Comment thread src/compression/mod.rs Outdated
Comment thread tests/zstd_dict_roundtrip.rs
- Fix test postcondition: assert_eq!(Some(0), level_table_count(0)) instead
  of unwrap_or(0) — properly distinguishes "L0 exists and is empty" from
  unexpected None (Copilot #54, CodeRabbit #56)
- Fix ZstdDictionary::new() doc: handle stores full 64-bit xxh3 hash
  internally; id() returns lower 32 bits on demand (CodeRabbit #55)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread src/compression/zstd_pure.rs
Comment thread Cargo.toml
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

@polaz polaz merged commit abd0ee5 into main Apr 9, 2026
23 checks passed
@polaz polaz deleted the feat/#218-featcompression-enable-dictionary-compression-in-p branch April 9, 2026 10:16
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 9, 2026
polaz pushed a commit that referenced this pull request Apr 9, 2026
## 🤖 New release

* `coordinode-lsm-tree`: 4.3.1 -> 4.4.0

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

##
[4.4.0](v4.3.1...v4.4.0)
- 2026-04-09

### Added

- *(compression)* enable dictionary compression in pure Rust backend
([#229](#229))

### Performance

- *(compression)* cache pre-compiled Dictionary across block decompress
calls
([#227](#227))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: sw-release-bot[bot] <255865126+sw-release-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(compression): enable dictionary compression in pure Rust backend

2 participants